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Abstract 

Time-course gene expression datasets, wlilcli record continuous biological processes of genes, have recently been used to 
predict gene function. However, only few positive genes can be obtained from annotation databases, such as gene 
ontology (GO). To obtain more useful Information and effectively predict gene function, gene annotations are clustered 
together to form a learnable and effective learning system. In this paper, we propose a novel multi-instance hierarchical 
clustering (MIHC) method to establish a learning system by clustering GO and compare this method with other learning 
system establishment methods. Multi-label support vector machine classifier and multi-label K-nearest neighbor classifier 
are used to verify these methods in four yeast time-course gene expression datasets. The MIHC method shows good 
performance, which serves as a guide to annotators or refines the annotation in detail. 
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introduction 

Genes are annotated in gene annotation databases [e.g., gene 
ontology (GO), KEGG, and MIPS], but the rate of gene 
identification is faster tlian gene annotation. Given that large 
amounts of identified genes, predicting the functions for un- 
annotated genes is a challenge. To date, many efTective machine 
learning techniques are proposed. However, function prediction is 
difTerent from the common machine learning tasks. A gene may 
have multiple functions and the function belongs to a set of genes. 
Function prediction belongs to the multi-label learning (MLL) 
task, and the common machine learning task is a single-instance 
single-label learning. Therefore, establishing an effective and 
learnable learning system for learning machines is necessary. 

In this study, different types of data have different learning 
approaches. We choose yeast time-course gene expression datasets 
because they record gene responses to various environments. 
Therefore, when searching for functions of a gene according to 
their involvement in biological processes, measurements of 
changes in gene expression throughout the time course of a given 
biological response are particularly interesting [1]. 

Gene function prediction method for different purposes can be 
grouped into supervised and unsupervised methods. Unsupervised 
methods (i.e., clustering) do not usually use existing biological 
knowledge to find gene expression patterns. Eisen et al. [2] 
discovered classes of expression patterns and identified groups of 
genes that are regulated similarly. Ernst et al. [3,4] clustered short 
time series gene expression data using a predefined expression 
model. Ma et al. [5] used a data-driven method to cluster time- 
course gene expression data. Other popular clustering algorithms 
include hierarchical clustering (HC), K-means clustering, and self- 



organizing maps [6]. Supervised methods (i.e., classification) use 
existing biological knowledge, such as GO, to create classification 
models. Lagreid et al. [1] applied the If-Then Rule Model to 
recognize the biological process from gene expression patterns. 
GENEFAS [7] predicted functions of un-annotated yeast genes 
using a functional association network based on annotated genes. 
Clare [8] presented a hierarchical multi-label classification (HMC) 
decision tree method to predict Saccharomyces cerevisiae gene 
functions. Schietgat et al. [9] presented an ensemble method 
(i.e., CLUS-HMC-ENS), which learns multi-tree for predicting 
gene functions of yeast. Kim et al. [10] combined the predictions 
of functional networks with predictions from a Naive Bayes 
classifier. Vazquez et al. [1 1] predicted global protein function 
from protein-protein interaction networks. Deng et al. [12] 
predicted gene functions with Markov random fields using protein 
interaction data. Nabieva et al. [13] proposed the functional flow 
method, which is a network-flow based algorithm, to predict 
protein function with few annotated neighbors. Recently, Magi et 
al. [14] annotated gene products using weighted functional 
networks. Liang et al. [15] predicted protein function using 
overlapping protein networks. Mitsakakis et al. [16] predicted 
Drosophila melanogaster gene function using the support vector 
machines (SVMs). 

The present study predicts gene function based on the 
assumption that genes participating in the same biological 
processes have similar expression profiles. We initially produce a 
non-noise system by selecting genes. Then, the multi-instance 
hierarchical clustering (MIHC) method is proposed to establish a 
learning system. Finally, multi-label support vector machine 
(MLSVM) and multi-label K-nearest neighbor (MLKNN) classi- 
fiers are used to predict the function of genes in time-course 
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(a) single instance single label 





(b)Multi-label 



(c) Multi-instance 



Figure 1 . Three types of learning task, (a) A gene is treated as a sample and owns one GO term only, which is called the single instance single 
label, (b) A gene is treated as a sample and annotated by multiple GO terms. This relationship between gene and GO terms is multi-label, (c) Multiple 
genes are treated as samples and share the same GO term. The relationship between genes and GO term is called multi-instance. 
doi:1 0.1 371/journal.pone.0090962.g001 



expression profile. Tlie experiment proves the feasibility and 
efEciency of the proposed method. 

Materials and Methods 

Gene function prediction 

In the GO database, the GO terms are organized as a directed 
acyclic graph (DAG). In the GO hierarchical structure, the genes 
are annotated at various levels of abstraction. When genes are 



annotated with the GO terms, the genes are annotated with the 
highest possible level of details, which corresponds to tlu; lowest 
level of abstraction [17]. Therefore, the goal of gene function 
prediction is for the annotators to annotate genes with the highest 
level GO terms. However, we can only obtain extremely few 
positive genes with similar GO terms, and littie information is 
available for a machine- l(;arning system. To obtain more positive 
genes and efficientiy predict gene function, many researchers up- 





The preset 
level 



(a)GNC (b)GOLC 

Figure 2. GO in the last level up-propagate along GO DAG. (a) The bold GO terms all own at least X genes, (b) The bold GO terms are in the C 
level of GO DAG. 

doi:1 0.1 371/journal.pone.0090962.g002 
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propagate gene annotation along a GO hierarchical structure and 
establish a learning system. The up-propagation approach can 
substantially group the following two methods: cluster genes to a 
certain GO level [8,9] and to a certain number [18]. 

Multi-instance learning (MIL) and MLL 

Zhou et al. [19] provided a detailed description on MIL and 
MLL. MIL and MLL are used to learn the function of 
fMiL-2'^{-l, + l} from the datasets {{Xi,y,)\Xi^ 
/,>'i6{-l, + l}} and I = {xi\i=l,...,n}, and 
from the datasets {(x,,yi)|x,e/,}^-^L} and L = {yi\i=l,...,n}, 
respectively. 

The relationships between genes and annotations are found in 
the GO database (Figure 1). Figure 1(b) shows that a gene is 
annotated by multiple GO terms, and Figure 1(c) shows that the 
genes are treated as instances of the sample with the annotation of 
GO. This GO term can be represented by those genes. Therefore, 
the relationships between gene and GO shown in Figures 1(b) and 
1(c) are called multi-label and multi-instance, respectively. 

MLSVM 

SVM is an effective machine learning method. For classification 
problems, SVM implements a large margin classifier by solving a 

Input: D-yeast expression dataset 

G-annotated gene set 

L-GO term set of G 

UG-un-annotated gene set 
Output: Y-predicted annotation set of UG 
Data process: 

1 Let the first value of D be always equal to 0, smooth out spikes, and obtain D' . 

2 Using Q L, and the Pearson correlation of D' , establish the non-noise system 

S = {{G„GO,)\i = \,...,M}- 

Establish learning system using MIHC 

IS = MIHC(S): 

3 Initialize the label set GOTerms = {GO, | GO, e S] 

4 For GO„GOj G GOTerms 



5 Calculate D(GO„GOj), D(GO,) = Cotr(G,), and D(GOj)^Cotr(Gj) using Eqs. 
(5) and (6) 

6 If DiGO„GOj)<nmx[D{GO,),D(GOj)) then 

8 Add (G_,GO_) to S 

9 Remove (G,GO) and (G^,GOj from S 

10 End 

11 If 5 no longer changes, return to S 

12 End 

Prediction function 

13 Y = MLSVM(TS, UG) or Y = MLKNN(TS, UG) 

Figure 3. MIHC algorithm and flow chart of function prediction. 

doi:1 0.1 371/journal.pone.0090962.g003 



quadratic optimization program on the basis of the principle of 
structural risk minimization. Li et al. [20] adjusted the SVM to 
multi-label classification by improving the quadratic optimization 

program. Suppose (x,, F,) is a training sample, where x, is the 
feature vector and F, is the sample label. Let i/'(x,,_v) = l if jeF,- 
and il/{xi,y) = ~l, otherwise the SVM classification problem 
model is described by the following optimization problem: 

1 II ||2 , ^V^"' s 

Wy,by,ijy2 ^l-l 

s.t. ^li{xi,y)[(wy-(p(xi))+by)>\-t,iy,i=\,...,m ^^"^ 
liy>'i,i= l,...,m 

where Wy-q){xi) is the inner product, (p{xi) is the function that 
maps Xi to a higher dimensional space Ti., Wy and by are the 
parameters for representing a linear discriminant function in Ti,, 
^iy is the non-negative slack variable introduced in the constraints 
to permit some training samples to be misclassified, C is the 
parameter to trade off the model complexity, and Tjy is the 
amplification coefficient of the loss ijy for handing the class 
imbalance problem [20,21]. 
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Table 1. Number of genes and classes In each learning system. 









Gene Number 






Class Number 






Method 


Parameters 




cdcl 5 


cdc28 


elution 




cdcl 5 


cdc28 


elution 


GNC 


1 = 10 


1213 


1284 


224 


2089 


190 


204 


27 


281 




X = 20 


1216 


1294 


221 


2050 


119 


129 


19 


166 




X = 30 


1215 


1297 


236 


2027 


84 


89 


18 


122 




X = 40 


1180 


1267 


204 


2022 


56 


61 


10 


99 




1 = 50 


1205 


1207 


183 


2038 


51 


42 


4 


80 


GOLC 


1=1 


1334 


1417 


261 


2269 


16 


15 


13 


15 




1 = 2 


1332 


1417 


261 


2269 


49 


44 


29 


51 




1 = 3 


1324 


1407 


261 


2263 


177 


182 


85 


209 




1 = 4 


1278 


1341 


241 


2179 


335 


342 


106 


381 


MIHC 


null 


1334 


1417 


261 


2269 


74 


61 


24 


29 



doi:l 0.1 371 /journal.pone.0090962.t001 
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Figure 4. Average AUC obtained from each learning system by MLSVM in all datasets. 

doi:1 0.1 371/journal.pone.0090962.g004 
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Figure 5. Average AUC obtained from eacli learning system by 

doi:1 0.1 371/journal.pone.0090962.g005 

Compared with the model proposed by Vapnik [22], the 
aforementioned model performs better in multi-label classification. 
Generally, multi-label classification is transformed to multiple 
binary classifications. Class imbalance problem is a considerable 
barrier for each binary classification. The parameter (^,^^ in Eq. (1) 
addresses this problem with a good performance. 

MLKNN 

The K-nearest neighbor is another stable popular machine 

learning method. This method performs more rapid classification 
than the SVM. Zhang et al. [23] improve KNN method for multi- 
label classification, which served as our motivation in our proposed 
model. In the MLKNN model, the candidate classes of a given test 
sample ti are obtained by 

Y,c = {y\Hxi,y) = 1 ,XieKNN{ti) ,ye Y} (2) 

where KNN{ti) is the k-nearest neighbor of tj among the training 
set S. For each candidate class yeYfc, the following likelihood 
score is calculated 



alph 

0.85 I 1 1 I 1 1 1 r 




0.5 1 1 1 1 1 1 1 1 1 

10 203040 50 60708090 

n% 



for eacli expression dataset. 



Score{ti.y) = J2si^knn{ii) simScore{ti,Si)^{si,y) (3) 

where simScore{ti,Si) is the similarity score oist to ti. The labels of 
ti are calculated by 

Yii = {y\Score{ti,y)>Q} (4) 



Gene selection 

We are not interested in all genes in the gene expression profiles. 
In gene function prediction, we assume that genes participating in 
the same biological processes have similar expression profiles 
[2,24]. For this proposal, we select genes that are significandy 
correlated with each other in the same funrtion. Let 
G = {genei\i=l,...,N}, L = {GOi\i=\,...,M}, and G{GOi) = 
{genejlgenejeG and annotated by GO,}, where N is the number 
of genes, and M is the number of GO terms. For each GO/eL, we 
draw a graph graphi = {vi,ei) for genes that significantly correlate 
with each other. geneieG{GOi) represents the v,- of graphi. An 
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Table 2. Average AUC obtained from cdc28 dataset by MLSVM. 





Method 


Parameters 


Parameter setting for n% 














1 0% 


20% 


30% 


40% 


50% 


60% 


70% 


80% 


90% 


GNC 


1=10 


0.559 


0.606 


0.631 


0.644 


0.656 


0.671 


0.670 


0.679 


0.703 




X = 20 


0.569 


0.603 


0.607 


0.625 


0.638 


0.646 


0.650 


0.650 


0.661 




X=30 


0.571 


0.583 


0.599 


0.612 


0.618 


0.629 


0.628 


0.632 


0.646 




X = 40 


0.529 


0.543 


0.552 


0.567 


0.577 


0.574 


0.578 


0.589 


0.602 




X=50 


0.532 


0.535 


0.558 


0.569 


0.572 


0.579 


0.588 


0.592 


0.596 


GOLC 


1 = 1 


0.594 


0.617 


0.634 


0.635 


0.648 


0.643 


0.641 


0.651 


0.653 




1 = 2 


0.609 


0.623 


0.624 


0.631 


0.624 


0.632 


0.635 


0.640 


0.643 




1 = 3 


0.601 


0.644 


0.657 


0.658 


0.654 


0.661 


0.656 


0.643 


0.668 




1 = 4 


0.601 


0.638 


0.647 


0.651 


0.654 


0.663 


0.658 


0.656 


0.662 


MIHC 


null 


0.621 


0.666 


0.727 


0.767 


0.800 


0.794 


0.817 


0.828 


0.838 
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edge exists between gene, and genej if gene, and genej are 
significandy correlated. We define graphf''^= (v™'"',e™''), which 
is the maximum clique o( graph, = {vj,ei) and G, = (v'^lv'/ev™") . 
However, the maximum clique problem is complete NP-hard [25- 
27]. In this paper, a greedy algorithm is used to deal with this 
problem, and the non-noise system of expression data and 
annotation are represented as S = {{Gi,GOi)\i=\,...,M}. 

Learning system establishment method 

Prior to the prediction of gene function, we establish a learning 
system for classification. Learning system establishment is the 
reconstitution of gene labels. GO DAG and MIPS are usually used 
to aid the establishment of learning systems. Clare et al. [8] and 
Schietgat et al. [9] established an MlPS-based learning system. 
Based on GO DAG, we use the same approach as those in [8] and 
[9] . We called this method GO level clustering (GOLC), which up- 
propagates the gene annotations to a preset GO level £, such as the 
first level (i.e., £=\) of the GO DAG, and cluster genes. In another 
approach, Hvidsten et al. [18] used the method called gene 
number clustering (GNC) to establish the learning system. The 
GNC method let the annotations up-propagate along the GO 



Table 3. Average AUC obtained from cdc28 dataset by MLKNN. 



Parameter setting for n% 



Method 


Parameters 


10% 


20% 


30% 


40% 


50% 


60% 


70% 


80% 


90% 


GNC 


>.= 10 


0.550 


0.576 


0.591 


0.599 


0.603 


0.609 


0.619 


0.619 


0.620 




?. = 20 


0.556 


0.571 


0.581 


0.585 


0.586 


0.589 


0.591 


0.591 


0.595 




>. = 30 


0.548 


0.559 


0.567 


0.568 


0.571 


0.571 


0.575 


0.578 


0.575 




?. = 40 


0.548 


0.557 


0.563 


0.565 


0.569 


0.569 


0.571 


0.567 


0.571 




?.= 50 


0.544 


0.552 


0.556 


0.559 


0.560 


0.564 


0.564 


0.564 


0.567 


GOLC 


1=1 


0.541 


0.547 


0.540 


0.546 


0.547 


0.547 


0.547 


0.547 


0.554 




1 = 2 


0.548 


0.557 


0.558 


0.565 


0.563 


0.569 


0.569 


0.566 


0.566 




1 = 3 


0.544 


0.573 


0.584 


0.588 


0.585 


0.589 


0.589 


0.585 


0.582 




1 = 4 


0.552 


0.580 


0.591 


0.592 


0.592 


0.594 


0.592 


0.591 


0.594 


MIHC 


null 


0.655 


0.733 


0.741 


0.743 


0.765 


0.774 


0.776 


0.777 


0.774 
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DAG until each annotation has at least A genes (A= 10,20 in [18]). 
Figure 2 shows the two aforementioned methods. 

MIHC method 

The HC method is a widely used machine learning technology 
in the clustering algorithm. Johnson [28] proposed the extensively 
studied hierarchical clustering scheme (HCS). The HCS initializes 
all sample dissimilarities and then forms a cluster from the two 
closest samples or clusters. These steps are repeated until all 
samples are clustered to one group. Therefore, we can set a 
terminal factor to stop the cluster rather than preset the number of 
groups. Thus, HCS is suitable for all kinds of datasets. 

To establish a more effective and efficient learning system, we 
import HCS and propose the novel MIHC method to establish a 
new learning system with the inherent characteristics of non-noise 
system S by cluster GO terms. In this method, we treat the 
relationship GO, between ge«e,eG, as multi- instance. Our samples 
(i.e., GO) are different from the traditional HC [28-30] because 
they are multi-instances not instances. Therefore, the distance of 
each sample is redesigned. According to [31], we define the 
distance as follows: 



Using MIHC to Predict Yeast Gene Function 




(a) n%=20% (b) n%=40% 



noc ROC 




Figure 6. The ROC of a class obtained from tlie IMIHC learning system by MLSVM for the cdc28 dataset. The ROC curves of the 20 
repetitions of the experiment as well as the four subplots (a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 80%, respectively, are shown. 
doi:l 0.1 371/journal.pone.0090962.g006 



Corr{Gi) = ^ | corr(genei,genej) | ,where genei,genejeGi (5) 



D{GOi,GOj) = Corr{Gi\JGj) (6) 

Where corr(genei,genej) is the Peanson correlation of genei and 
gencj. Figure 3 shows the MIHC algorithm and flow chart of 
function prediction. 



Results and Discussion 

Data 

The yeast time-course expression datasets used in this study are 
obtained from [32] (downloaded from http://genome-www. 
stanford.edu/ cellcycle/ data/ rawdata/). The four datasets are 
yeast cell cycle expression data with different time points and 
circumstances. We use the method in [3], preprocess the raw data, 
and make the first value always equal to zero. Then, the average 
transformation = + ?,_i)/2 is used to smooth out the spikes. 
Gene annotation data can be obtained from GO [33] (download- 
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Table 4. Results of the TP, FP, TN, and FN in MIHC by MLSVM. 



First experiment results 



Second experiment results 



n% 


Train/Test 


-)- 




TP 


FP 


TN 


FN 


TP 


FP 


TN 


FN 


10% 


Train 


5 


5 




















Test 


44 


207 


19 


45 


162 


25 


22 


30 


177 


22 


20% 


Train 


10 


10 




















Test 


39 


202 


25 


31 


171 


14 


29 


38 


164 


10 


30% 


Train 


15 


15 




















Test 


34 


197 


24 


40 


157 


10 


26 


44 


153 


8 


40% 


Train 


20 


20 




















Test 


29 


192 


23 


51 


141 


6 


21 


68 


124 


8 


50% 


Train 


25 


25 




















Test 


24 


1 87 


1 7 


54 


1 33 


7 


20 


55 


1 32 


4 


60% 


Train 


29 


29 




















Test 


20 


183 


17 


44 


139 


3 


15 


44 


139 


5 


70% 


Train 


34 


34 




















Test 


15 


178 


11 


41 


137 


4 


12 


50 


128 


3 


80% 


Train 


39 


39 




















Test 


10 


173 


7 


37 


136 


3 


8 


36 


137 


2 


90% 


Train 


44 


44 




















Test 


5 


168 


4 


40 


128 


1 


5 


36 


132 


0 


1. V 


means positive sample number; ' 


— ' means negative sampi 


e number. 
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ed from http://www.geneontology.org/GO.downloads. 
annotations. shtml). GO terms are composed of three disjointed 
DAGs, namely, biological process (BP), molecular function, and 
ceUular component. We only use BP for this study because it is 
more complete than the two other disjointed DAGS. 

Performance evaluation 

Leave-one-out and leave-a-percent-out cross validation [14] 
approaches are two of the most extensively used methods for 
evaluating the performance of a function prediction algorithm. 
The former is usually used in a small dataset, whereas the latter is 
more suitable to a large dataset. The former method randomly 
leaves one sample of the experiment dataset for testing and assigns 
all of the other samples for training. This process is repeated many 
times. Meanwhile, the latter method splits the experiment dataset 



Table 5. Average TPR and FPR in MIHC by MLSVM. 



into two sets, namely, the training and testing sets. The training set 
is composed of a specified proportion of positive and negative 
samples, whose labels are known. Conversely, the labels of the 
testing set are concealed from the classifiers. The proportion of the 
training dataset is gradually increased to test the performance of 
the learning system. The true labels of the testing set are compared 
with the prediction labels to evaluate the performance of the 
system. We select the latter method to evaluate the MIHC 
method. To accurately measure the performance, the receiver 
operating characteristic (ROC) curve and area under the ROC 
curve (AUC) are introduced to quantify the results. The 
classifications are often based on continuous random variables. 
The probability of belonging in a class varies with different 
threshold parameters. That is, the values of true and false positive 
rates (TPR and FPR, respectively) vary with different threshold 
parameters. The ROC curve parametrically plots the TPR versus 
the FPR with varying parameters. The TPR and FPR are 
calculated by Equations (7) and (8). 



n% 


TPR 


FPR 


10% 


0.713 


0.286 


20% 


0.733 


0.255 


30% 


0.760 


0.250 


40% 


0.757 


0.264 


50% 


0.783 


0.257 


60% 


0.808 


0.272 


70% 


0.783 


0.241 


80% 


0.745 


0.234 


90% 


0.810 


0.243 
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TPR = TP I {TP + FN), (7) 



FPR = FP/(FP+TN), (8) 

Where TP, FP, TN, FN represent the number of true positive 
(TP), false positive (FP), true negative (TN), and false negative (FN) 
predictions, respectively. Therefore, the TPR and FPR can reflect 
the sensitivity and specificity of prediction. AUC is calculated to 
quantify the content of the ROC curves. A reliable and valid AUC 
estimate can be interpreted as the probability that the classifier wiU 
assign a higher score to a randomly chosen positive sample rather 
than to a randomly chosen negative sample. 
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Experiment analysis 

The four yeast time-course expression datasets are as follows: 
alpha, cdcl5, cdc28, and elution, which record the niRNA levels 
of 18, 24, 17, and 14 timi; points in the whole cell cycle under 
dilferent circumstances, respectively. For each expression dataset, 
GNC (A =10,20,30,40,50), GOLC (^=1,2,3,4), and MIHC 
methods are used to establish the learning system and compare 
their performances. The rationale for setting the value of the 
previously mentioned parameter is as follows. First, we want to 
determine whether different numbers and different levels of gene 
group remarkably change function prediction. Second, for the 
GOLC method, the error rate of a given level is accumulated if a 
deeper level gene function is required. 

The number of genes in the MIHC learning system is consistent 
with the non-noise system, but other learning systems cannot 
maintain this feature. Table 1 shows the number of genes and 
classes for each learning system. The MIHC learning system also 
has better class features than other learning systems. 

The MIHC learning system is tested on MLSVM and MLKNN 
classifiers. In the classification task, the MLL task is decomposed 
into a series of binary classification tasks. However, the negative 
samples are far more than the positive samples for each class. 
Therefore, class imbalance problem should be considered. Further 
information about the number of positive and negative samples in 
the cdc28 and elution experiment datasets are shown in Table SI 
and Table S2 in the Supporting Information section. The training 
samples have to be balanced, that is, the same numbers of negative 
and positive samples are used for the training and testing sets. For 
each class, we randomly select n% of positive samples and the 
same number of negative samples as the training set, and the rest 
are for the testing set. The value of n% increased from 10% to 
90%. If the number of positive samples in one class is very low (less 
than 10), the number of positive samples in tlu' training s("t is 
increased gradually. The experiment is repeated 20 times (or 
more, and the mean value shows minimal changes) for each n, and 
the mean value is calculated. Given the class imbalance, a high 
accuracy can stiU be obtained when the classifier divides all the 
samples into negative. In this study, AUG is used to evaluate the 
performance of MIHC. We compare MIHC with GOLC and 
GNC. For each expression dataset, the average results obtained 
from each learning system by SVM and KNN classifiers are shown 
in Figures 4 and 5. Tables 2 and 3 show the results from cdc28 
dataset. As the n% increased, the AUC value of MIHC increased 
drastically whereas those of GOLC and GNC increased slowly. 
These results prove that generally, the classes in the MIHC 
learning system are more interesting and the genes therein have 
more correlation power compared with those in the classes in the 
two other learning systems. This result can be explained as follows. 
Genes are transcribed into mRNA and then into proteins. To a 
certain extent, the level of mRNA can reflect the amount of 
protein being generated. However, this amount may be influenced 
by several factors, such as the decomposition of the speed of 
mRNA and the switching off of proteins. Cells are so efficient that 
only the necessary proteins are composed. Therefore, variances in 
gene expression match the active level of biological process. GNC 
and GOLC cluster GO by up-propagating it along with the GO 
DAG. Meanwhile, the MIHC method treats the gene expression 
profile as the feature of GO and clusters GO to ensure superior 
performance. Moreover, when GO is further up-propagated, the 
information that reflects the correlation between genes may be 
lost. Only the GO dataset determines which genes own which GO 
and whether or not the gene exerts a certain function of the GO in 
the experiment dataset. However, we assume that genes exert all 
their GO because the datasets in our study consist of cell cycle 



expression data. Compared to GNC and GOLC, MIHC relies on 
statistical correlation. Consequentiy, MIHC is less concerned 
about whether or not the gene exerts the function. This problem 
will be certainly considered in the future study. 

Lastly, to obtain a satisfactory explanation in a real-world 
problem, the ROC curve of a class obtained from the MIHC 
learning system for the cdc28 dataset is shown in Figure 6. The 
results for the TP, FP, TN, and FN are shown in Table 4 (given 
that the experiment is repeated 20 times, only the middle-level 
results for the 2 repetitions are shown in Table 4; the average TPR 
and FPR in the 20 repetitions of the experiment are shown in 
Table 5). In Figure 6, the ROC curves of the 20 repetitions of the 
experiment as well as the four subplots (a), (b), (c), and (d) with 
parameter n% = 20%, 40%, 60%, and 80%, respectively, are 
displayed. As n increases, the number of positive samples in the 
testing set decreases. The classifier sometimes pays a greater price 
to identify as many positive samples in the testing set as possible. 
The sample distribution in training set may also influence the 
prediction result. The ROC curve in subplot (d) occasionally 
exhibits unsatisfactory performance. The ROC curves of all the 
datasets are presented in Figures SI to S8, and the average TPR 
and FPR of all the datasets are shown in Tables S3 to SIO in the 
Supporting Information section. 

Conclusion 

In this paper, we propose the MIHC method to establish a 
learning system, which is verified by SVM and KNN using four 
yeast gene expression datasets. In the MIHC method, Pearson 
correlation is the distance between multi-instance samples, and 
HC is used to cluster the samples. Compared with other learning 
system establishment methods, the MIHC learning system exhibits 
better performance because the samples are more easily recog- 
nized. This method also maintains data integrity with non-noise 
system. To our knowledge, this study is the first to use HC 
algorithm to cluster multi-instance samples. 

Supporting information 

Figure SI ROC curves are obtained from cdc28 dataset 
by MLSVM. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 
(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 
80%, respectively, are shown. 
(TIF) 

Figure S2 ROC curves are obtained from cdc28 dataset 

by MLKNN. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 
(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 
80%, respectively, are presented. 
(TIF) 

Figure S3 ROC curves are obtained from cdcl5 dataset 
by MLSVM. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 
(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 
80%, respectively, are shown. 
(TIF) 

Figure S4 ROC curves are obtained from cdcl5 dataset 

by MLKNN. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 

(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 

80%, respectively, are displayed. 

(TIF) 
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Figure S5 ROC curves are obtained from alpha dataset 
by MLSVM. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 

(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 

80%, respectively, are displayed. 

(TIF) 

Figure S6 ROC curves are obtained from alpha dataset 
by MLKNN. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 
(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 
80%, respectively, are presented. 
(TIF) 

Figure S7 ROC curves are obtained from elution 
dataset by MLSVM. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 
(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 
80%, respectively, are shown. 
(TIF) 

Figure S8 ROC curves are obtained from elution 
dataset by MLKNN. The ROC curves of each learning system, 
generated by average TPR and FPR, as well as the four subplots 
(a), (b), (c), and (d) with parameter n% = 20%, 40%, 60%, and 
80%, respectively, are displayed. 
(TIF) 

Table SI Number of positive and negative samples in 
MIHC from the cdc28 dataset. 

(XLS) 

Table S2 Number of positive and negative samples in 
MIHC from the elution dataset. 

(XLS) 
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